I recently grabbed a copy of the AOL search data that was released earlier this month. The dataset is rather large – 450MB compressed – over 1.4GB uncompressed in MySQL.
This data has been the subject of much controversy and has even resulted in at least 3 people getting fired. .
Anyhow, I’ve been parsing through the dataset and there’s some interesting stuff in there.
I’ll keep this thread updated with the information that I find…
- People use the search bar to type in website names a lot more than you realize. Out of the over 17 million queries, almost 3.5 million of them contain .com, .gov, .edu, or .org.
- People refine their searches as they go along
- dr shermam
- dr sherman longwood
- dr sherman vision therapy
- dr sherman vision therapy fl
- 6;p6p5p56ptpptptptppprprpprpprprprp…
I’m also running an analysis on the words that are in the queries. I’m curious how many 1 word, 2 word, 3 word, and 4 word queries there are. It might not amount to anything but it’ll be interesting to look at 🙂 I expect it to take quite a while to break all this data down so probably no updates until the end of the week.
G-Man